Machine learning of factors for improving oyster hatchery production
Our research focuses on investigating the variability in oyster hatchery production yields and identifying factors that influence these outcomes. By understanding the predictors of hatchery production outcomes, we aim to enhance the efficiency of oyster hatchery operations.
Improving the efficiency of hatchery production is crucial as it not only impacts profitability but also addresses concerns among growers regarding the availability of oyster seeds. To achieve this, we gathered production data from the Horn Point Laboratory Oyster Hatchery and collected relevant information on weather and water conditions near the hatchery.
Leveraging the power of machine learning, we employed predictive modeling techniques to identify key indicators of low and high aquaculture yields. Yield was defined as the percent of pediveliger larvae produced from the fertilized eggs initially stocked at the beginning of the production process. The models we developed can generate forecasts for specific conditions and provide valuable insights to hatcheries on implementing appropriate practices to enhance cost-effectiveness.
We are pleased to make our findings publicly available, encouraging comments and feedback from the research and oyster growers community. Should you have any thoughts or suggestions, please don’t hesitate to share them as a comment below or contact the authors.
1 Background
The success of oyster aquaculture, fisheries augmentation, and restoration in the Chesapeake Bay heavily relies on the hatchery production of larvae. Despite the expertise of skilled hatchery staff, shellfish hatcheries frequently encounter periods of suboptimal larval growth and inconsistent production levels. The Horn Point Laboratory Oyster Hatchery (HPLOH) is no exception, with highly variable yield rates, ranging from 5% to 17% on average, and occasional broods exceeding 50% yield (Gray et al. 2022).
In addition to reduced production levels, HPLOH and other hatcheries commonly face acute mortality events known as crashes. Crashes have been an ongoing issue within the global shellfish aquaculture industry (Walker 2017) and affect various cultivated shellfish species (Jones 2006). These crashes can occur suddenly, leading to the loss of entire batches of larvae overnight. Observations have revealed larvae covered in bacterial coatings or filled with ciliates, making it challenging to determine whether these factors caused the crash or opportunistic invaders took advantage of a compromised batch (Estes et al. 2004). Besides bacterial contamination, acute fluctuations in water quality, such as harmful algal blooms or abrupt changes in salinity or oxygen levels, can also disrupt culture conditions and critical components of the production process, including broodstock conditioning or algae cultures (Jones 2006).
To ensure suitable production conditions, hatcheries regularly monitor the quality of their source water. When conditions are suboptimal, facilities can make adjustments, such as adding salt, to mitigate adverse environmental effects and resume production. However, in many cases, the causes of crashes or production inefficiencies remain unidentified. Hatchery staff often lack the time and resources to investigate the root causes thoroughly. As a result, the most common approach to managing crashes is to flush out “bad” water, clean tanks, and hope that refilling them with “good” water will enable production to resume. Unfortunately, adverse conditions can persist for extended periods (Barton et al. 2012; Gray et al. 2022), severely limiting oyster production regionally and incurring significant financial losses for the industry (e.g., over $110 million during the “Seed Crisis” in the Pacific Northwest, Ekstrom et al. 2015). Therefore, there is a critical need to identify predictors for hatchery yield, particularly regarding low yield, to prevent costly crashes. Tools that enhance production reliability would increase the profitability of public and private hatcheries by improving resource utilization and minimizing waste. Moreover, ensuring a more reliable seed supply would alleviate concerns among growers in Maryland and beyond, creating greater opportunities for industry expansion.
2 Data
2.1 Oyster production data
We supplemented the long-term hatchery production data used by Gray et al. (2022) with more recent data from the HPLOH. The combined dataset spanned the years 2011 to 2021 and provided detailed information on broodstock, spawning, and larval production in the hatchery.
The primary focus of our study was the hatchery yield, which was defined as the percentage of pediveliger oyster larvae (i.e. eyed larvae) out of all the fertilized eggs added into a tank: \[ \text{Yield} = \frac{\text{Eyed larvae produced}}{\text{Eggs added to tank}} \times 100\%. \] To gain a clearer understanding of the factors influencing production outcomes, we specifically examined broods that were not mixed during the production process. Consequently, we did not include information on mixed broods in our analysis.
During the data cleaning process, several measures were taken:
- All yields were set to zero when the HPLOH indicated that the brood was “dumped” due to a production crash.
- Records with HPLOH notes indicating “no eggs” (7 records) or those lacking information on the number of eyed larvae (9 records) were removed.
- Records displaying an unusually low production rate (number of days to first eyed oysters) were removed. We identified three such records with production rates below 9 days, which typically indicate rushed production for a specific buyer of larvae.
From the HPLOH records from different stages of the production process, we considered various predictors of yields:
- Broodstock collection:
- Name of the oyster bar of collected broodstock.
- Conditioning:
- Average conditioning pH (NBS units),
- Average conditioning salinity (ppt), and
- Average conditioning temperature (°C).
- Spawning:
- Average shell height of females (mm),
- Average shell height of males (mm),
- Fecundity (millions of eggs per female),
- Gonadal index (1 to 4), and
- Stimulation used (female / male / other / none).
- Larval culture:
- Week of year when production started (1 to 52) and
- Carbonate tank buffering (yes or no).
We narrowed down the list of predictors compared to Gray et al. (2022) to focus on identifying explanatory relationships that could be used to predict yields before the production process commenced. For instance, we did not include the year of broodstock collection as a predictor due to its limited explanatory value. Also, the percentage survival to prodissochonch II stage (a.k.a. D-hinge) was not used since this information becomes available only later in the production process.
To determine the date when each new brood’s tank was filled with ambient water, we used the date of the first drain and the corresponding oyster age as reported by the HPLOH: \[ \text{Date:time of first fill} = \text{Date:time of first drain} - \text{Age at the first drain (hours)}. \] This calculated date served as a reference point to extract information on environmental conditions near the hatchery.
2.2 Environmental data
We used different information sources to obtain information on several environmental variables recorded on the day of tank filling. To account for potential delayed effects on oyster hatchery production outcomes, these variables were utilized both in their original form and with a lag of 1 to 7 days, excluding the variables representing winter conditions.
Goose’s Reef buoy
We used quality-controlled records from the Goose’s Reef buoy, which is part of the NOAA’s Chesapeake Bay Interpretive Buoy System (CBIBS). The water quality variables included:
- Chlorophyll concentration (ug/L),
- Dissolved oxygen concentration (mg/L),
- Salinity (PSU),
- Turbidity (NTU), and
- Water temperature (°C)
In addition to water quality, the buoy also recorded atmospheric conditions, including:
- Air temperature (°C),
- Wind direction (degrees clockwise from true north), and
- Wind speed (m/s).
Examinign water quality effects on hatchery production is an intuitive avenue of investigation, but in an effort to build predictive tools that may eventually lead to an early warning system for the hatchery, it was important for us to examine weather data and its effects on Choptank water quality and hatchery production.
During the data cleaning process, the following steps were taken to ensure data quality and completeness:
- Retained only data with the highest quality codes (1 and 2).
- Removed records of chlorophyll, dissolved oxygen, salinity, and turbidity outside the range of 0–50.
- Aggregated the 10-minute data to obtain daily averages.
- Used simple linear regressions between the buoy data and corresponding data measured at the NOAA’s station CAMM2 in Cambridge, MD, to fill in missing information on daily air and water temperatures where possible (adjusted coefficient of determination \(R^2_{adj.} > 0.95\) in both cases).
- Used a simple linear regression between the buoy air temperature and air temperature measured at the Cambridge airport to fill in missing daily air temperatures where possible. (Linear regressions of Goose’s Reef wind speed and direction with corresponding observations measured at the station CAMM2 and the airport produced \(R^2_{adj.} < 0.7\) and consequently were not used to fill in the missing data.)
- Filled in the remaining missing data for the Goose’s Reef using the iterative random forests algorithm implemented in the R package
missForest(Stekhoven 2022). This algorithm implemented up to 100 random forests, each with 50 regression trees, and incorporated all available variables measured at the buoy, station CAMM2 (air and water temperatures, wind speed and direction), and the Cambridge airport (air temperature, wind speed and direction), along with the numeric year, month, and day of the year for time-series predictions.
By following these steps, we completed the dataset of environmental conditions directly measured near the HPLOH, incorporating both water quality and atmospheric information from the Goose’s Reef buoy.
Daymet gridded weather conditions
We used the R package daymetr (Hufkens 2023) to access the Daymet Version 4 dataset. Daymet offers gridded estimates of daily weather based on statistical modeling techniques that interpolate and extrapolate ground-based observations. For the 1 km \(\times\) 1 km grid corresponding to the location of the HPLOH, we obtained the following daily time series of weather variables:
- Precipitation (mm),
- Maximum temperature (°C), and
- Minimum temperature (°C).
Additionally, we calculated the daily average temperature (°C) by taking the average of the minimum and maximum temperatures.
Remote sensing products
We used Google Earth Engine pipelines to collect daily time series of the following variables:
- Normalized difference vegetation index (NDVI). We obtained NDVI measurements from both the Landsat-8 Operational Land Imager (OLI) and the Sentinel-2 MultiSpectral Instrument (MSI). The NDVI serves as an indicator of vegetation health and was used to estimate the potential impact of agricultural practices and other land uses on water quality. These measurements were aggregated across the larger Choptank River watershed and nearby subwatersheds. Specifically, we examined the changes (differences) in NDVI over various time intervals: 1, 2, 4, 8, 16, and 32 days. This resulted in seven NDVI variables that helped us understand the effects of agricultural practices, such as fertilizer and herbicide application.
- Chlorophyll. We also considered chlorophyll measurements within a 1-km radius of the HPLOH (inlet) and in the Choptank River subwatershed used for calculating the NDVI. These two remotely sensed chlorophyll variables provided insights into the levels of chlorophyll-a, which is an indicator of phytoplankton abundance and water quality.
Winter averages
To consider the influence of winter conditions preceding the hatchery production seasons, we averaged the following daily variables for January–February of each year:
- Air temperature (Goose’s Reef),
- Chlorophyll concentration (Goose’s Reef),
- Oxygen concentration (Goose’s Reef),
- Salinity (Goose’s Reef),
- Water temperature (Goose’s Reef),
- Precipitation (Daymet), and
- NDVI (remote sensing).
Overall, we considered 144 predictors, including the original 32 predictors and their lagged versions. Table 1 shows the statistical summaries, and Figure 1 shows pairwise plots of the original (not lagged) variables.
| unit | n | mean | sd | min | max | range | se | |
|---|---|---|---|---|---|---|---|---|
| Yield | % | 615 | 10.097 | 9.474 | 0.000 | 56.359 | 56.359 | 0.382 |
| AvgShellHeightFemales | mm | 612 | 114.837 | 10.435 | 82.457 | 187.600 | 105.143 | 0.422 |
| AvgShellHeightMales | mm | 608 | 112.614 | 15.712 | 57.556 | 296.319 | 238.764 | 0.637 |
| Fecundity | millons of eggs/female | 614 | 13.685 | 7.371 | 0.000 | 48.100 | 48.100 | 0.297 |
| GonadalIndex | 1 to 4 | 556 | 2.552 | 0.259 | 1.500 | 3.250 | 1.750 | 0.011 |
| Spawn_pH | NBS units | 459 | 7.796 | 0.242 | 6.701 | 8.305 | 1.604 | 0.011 |
| Spawn_Salinity | ppt | 539 | 10.115 | 1.614 | 5.875 | 14.187 | 8.312 | 0.070 |
| Spawn_Temp | °C | 539 | 19.742 | 1.918 | 15.674 | 27.640 | 11.966 | 0.083 |
| Week | 1 to 52 | 615 | 24.418 | 7.254 | 9.000 | 39.000 | 30.000 | 0.292 |
| GR_AirTemp | °C | 615 | 20.645 | 6.709 | -0.568 | 30.541 | 31.109 | 0.271 |
| GR_Chlorophyll | ug/l | 615 | 8.866 | 5.968 | 0.619 | 47.057 | 46.438 | 0.241 |
| GR_Oxygen | mg/l | 615 | 9.505 | 1.905 | 4.772 | 14.635 | 9.863 | 0.077 |
| GR_Salinity | ppt | 615 | 10.685 | 2.433 | 2.596 | 17.059 | 14.463 | 0.098 |
| GR_Turbidity | NTU | 615 | 4.667 | 5.425 | 0.315 | 41.114 | 40.799 | 0.219 |
| GR_WaterTemp | °C | 615 | 21.326 | 7.058 | 0.369 | 30.342 | 29.973 | 0.285 |
| GR_WindDirection | degree 1° to 360° | 615 | 182.762 | 55.159 | 48.720 | 322.340 | 273.620 | 2.224 |
| GR_WindSpeed | m/s | 615 | 4.688 | 1.828 | 0.000 | 11.514 | 11.514 | 0.074 |
| DM_Precip | mm/day | 615 | 4.641 | 10.383 | 0.000 | 99.230 | 99.230 | 0.419 |
| DM_TempAvg | °C | 615 | 21.231 | 6.560 | -1.305 | 31.050 | 32.355 | 0.265 |
| DM_TempMax | °C | 615 | 26.276 | 6.535 | 1.620 | 36.510 | 34.890 | 0.264 |
| DM_TempMin | °C | 615 | 16.187 | 6.862 | -4.260 | 26.180 | 30.440 | 0.277 |
| CHLct | ug/l | 615 | 33.865 | 14.721 | 6.790 | 83.061 | 76.271 | 0.594 |
| CHLin | ug/l | 615 | 33.865 | 14.721 | 6.790 | 83.061 | 76.271 | 0.594 |
| NDVI | -1 to 1 | 615 | 0.476 | 0.222 | -0.023 | 0.929 | 0.953 | 0.009 |
| NDVI_d1 | -1 to 1 | 615 | 0.001 | 0.077 | -0.907 | 0.755 | 1.662 | 0.003 |
| NDVI_d16 | -1 to 1 | 615 | 0.042 | 0.209 | -0.624 | 0.945 | 1.569 | 0.008 |
| NDVI_d2 | -1 to 1 | 615 | 0.006 | 0.122 | -0.895 | 0.846 | 1.742 | 0.005 |
| NDVI_d32 | -1 to 1 | 615 | 0.079 | 0.225 | -0.879 | 0.726 | 1.605 | 0.009 |
| NDVI_d4 | -1 to 1 | 615 | 0.006 | 0.157 | -0.871 | 0.706 | 1.577 | 0.006 |
| NDVI_d8 | -1 to 1 | 615 | 0.023 | 0.187 | -0.824 | 0.884 | 1.708 | 0.008 |
| Winter_NDVI | -1 to 1 | 615 | 0.270 | 0.082 | 0.168 | 0.458 | 0.291 | 0.003 |
| Winter_Precip | mm/day | 615 | 3.075 | 0.697 | 1.676 | 4.002 | 2.326 | 0.028 |
| Winter_WaterTemp | °C | 615 | 3.806 | 1.458 | 1.696 | 6.109 | 4.413 | 0.059 |
| Winter_Salinity | ppt | 615 | 13.657 | 1.496 | 9.669 | 15.619 | 5.950 | 0.060 |
| Winter_Oxygen | mg/l | 615 | 12.817 | 0.391 | 12.343 | 13.602 | 1.259 | 0.016 |
| Winter_Chlorophyll | ug/l | 615 | 7.831 | 2.724 | 2.129 | 11.102 | 8.973 | 0.110 |
| Winter_AirTemp | °C | 615 | 2.872 | 1.954 | -0.404 | 5.983 | 6.387 | 0.079 |
Additionally, Figure 2 shows how some of the environmental variables relate to those measured at the hatchery. We observe correlations between the temperatures (ambient water and air temperatures are correlated with water temperature at the hatchery – Figure 2 A–C) and water salinities (Figure 2 D), but no visible relationship is observed between precipitation and hatchery salinity (Figure 2 E).
3 Machine learning methods
In our study, we employed data-driven machine learning methods to investigate the impact of various variables on hatchery production yield and to identify key predictors. The main method, which produced a predictive model, was the random forest. To select relevant variables and appropriate structure of the random forest model, we tried several variable selection techniques and combinations of random forest parameters. The technique for picking important variables and the model parameters were selected by assessing the resulting model accuracy on data that were deliberately left out (i.e., we cross-validated the selections we made). Finally, we used Shapley values to interpret the effects of individual variables represented in the model and decompose individual predictions of hatchery yields. The subsections below describe each of the methods in more detail and name the software packages we used. Overall, we used R 4.2.2 (R Core Team 2022) for the computations, and our full code can be accessed on GitHub.
3.1 Random forest
Random forest (Breiman 2001) is a versatile approach that builds predictive models for hatchery yields, given a training dataset with observed yields (response variable) and potentially related variables (predictors). The random forest algorithm picks a predictor and its value that splits the response values into the most distinct of homogeneous subsets (to minimize variance around the group means). Consecutive splits can be done using a different predictor or just using a different threshold of the same predictor. The resulting sequences of splits have a shape of a tree, where the root is the whole input dataset and leaves are the ultimate subsets of the data that do not get split anymore (the minimal size of these subsets is usually called the nodesize). Average response values in the leaves are used as the tree predictions. For example, to obtain yield predictions from a tree, given a certain combination of predictors, the algorithm finds a leaf that corresponds to this combination and uses the average yield in that leaf as the prediction.
To grow multiple trees from the same input dataset and to make the trees more different from one another, each tree is grown on a bootstrap sample (sample with replacement from the original data), and only a random subset of size mtry of all the predictors is considered when selecting each new split in a tree. Predictions from all trees in the random forest are averaged to obtain a single prediction for a given combination of input variables. While using simple randomization and multiple simple splits, random forest is capable of modeling complex relationships between variables and providing accurate predictions.
An advantage of the random forest algorithm is the low number of parameters that need to be tuned to improve the model structure and predictive performance. Most often, mtry and nodesize are tuned, while default values can be used for other parameters of the algorithm, for example,
- the size of the bootstrap samples (we used the default, equal to the sample size),
- case weights (we weighted all observations equally), and
- number of trees (we used the default of 500 since by reaching this number of trees the model errors plateaued and did not decrease further).
The nodesize controls the depth of the trees: smaller nodesize results in deeper and possibly overfitted trees (overfitted models provide a close fit to the data but often have poor performance on a new testing dataset), while larger nodesize results in shallower trees with better out-of-sample performance. However, random forest aggregates predictions from many overfitted trees, hence the effect of nodesize is often modest, and full trees can be grown. By default, nodesize = 5 when modeling a numeric response variable.
Low mtry corresponds to more different (decorrelated) trees, which results in lower variance but a higher bias of the random forest. In turn, high mtry increases the chance of the most relevant variable to be picked when growing the trees, which reduces the bias but increases the variance of the ensemble (Hastie et al. 2009). The default value of mtry is (rounded down) \(p/3\) or \(\sqrt{p}\) in different software packages, where \(p\) is the total number of predictors.
The mtry, nodesize, and other parameters mentioned here are sometimes called the ‘tuning parameters’ or ‘hyperparameters’ because they are set by the user and are not estimated from the input data directly (for example, compared to coefficients in statistical regression models). In this study, we used the cross-validation technique described below to consider several combinations of these parameters and make a data-informed decision. Specifically, we considered mtry of 5, 7, 10, 15, 20, 25, 30, 40, and 50; and nodesize of 1, 3, 5, and 10. We used the R package ranger (Wright et al. 2023) to train these random forest models.
3.2 Identification of important predictors (a.k.a. variable selection)
We apply variable selection techniques to improve performance of the random forest model and, as a result, identify important predictors of hatchery yield. If among the \(p\) original predictors there are many irrelevant ones, it leads to a smaller chance for at least one relevant predictor to be picked in the random subset of mtry predictors when creating a new tree split (Hastie et al. 2009). Reducing the number of irrelevant predictors ensures that most subsets of size mtry will contain at least one relevant predictor for growing the tree. However, it is not required to get rid of absolutely all irrelevant predictors since they might not be picked for making a split at all, as long as there is a better alternative in the mtry set (Hastie et al. 2009).
The fact that the final random forest model might not employ all of the input variables and the modeled relationships are not expressed in a single coefficient (like it is possible in statistical linear regression models) is due to the black-box nature of the method. We address it using the model interpretation techniques discussed below.
The importance of a predictor in a random forest is usually quantified by the average increase of prediction errors given the values of this predictor were permuted. If the predictor is important, then random permutations of its values will worsen the prediction accuracy more than if the predictor was not important. However, there is no inherent threshold to separate the importance values into statistically significant or not, hence additional methods have been developed.
From a variety of methods for variable selection, we applied two best-performing algorithms developed specifically for random forest models: Boruta (Kursa and Rudnicki 2010) and interaction forest (Hornung and Boulesteix 2022). We also applied each of these algorithms recursively (until no irrelevant predictors were identified) and compared to the baseline when all predictors were used, without selection. This gave us five options for variable selection that we compared in a cross-validation study.
To decide on the significance of random forest predictors, the Boruta algorithm creates permuted copies of the original predictors. Then, the algorithm calculates the importances of these ‘shadow’ predictors and groups them into maximal, mean, and minimal based on the obtained values. Since the shadow predictors are inherently irrelevant, as they are just random permutation, their importances represent the importance distribution of irrelevant predictors. The algorithm compares the maximal shadow importances with importances of real predictors, using the pairwise t-test adjusted for the number of comparisons. In our single implementation of Boruta, we kept only those predictors that had significantly higher performances than the maximal shadow ones (i.e., removed predictors with importances below or not significantly different from the shadow ones). In our recursive implementation of Boruta, we removed only rejected predictors (i.e., those with importances significantly lower than the shadow ones) and repeated the selection until no more predictors could be rejected. We used the R package Boruta (Kursa and Rudnicki 2022) with its default settings for applying this algorithm.
The algorithm of interaction forest presents a new effect importance measure (EIM) that allows us to rank the importances of individual predictors (similar to the more traditional approaches described above) and to consider bivariate splits (interaction effects) and their importance. In our simple implementation of interaction forest, all original predictors were used, without variable selection, hence only the selection of splits differed from the conventional random forest. In our recursive implementation of the interaction forest, we removed predictors with univariate EIM < 0 and retrained the interaction forest. We also considered the option of keeping predictors that resulted in bivariate EIM > 0 (positive importance of interaction effects), but it did not result in reducing the number of predictors from the original number. We used the R package diversityForest (Hornung and Wright 2023) to train the interaction forests.
3.3 Cross-validation for performance evaluation
To compare different options of random forest algorithms and variable selection and select one for the final model, we used cross-validation on the available data. In cross-validation, only a portion of the data is used for training the model (a.k.a. training set) while a hold-out portion (a.k.a. validation set) is treated as new data that can be used to assess real-life predictive performance of the model.
Specifically, we used \(k\)-fold cross-validation, where the input cases are randomized and separated into \(k\) equal folds (subsets). We can use the folds \(2, \ldots, k\) to train the model, and predictors from fold 1 to make yield forecasts for fold 1 (the validation fold). The procedure is repeated until all \(k\) folds are used as validation folds. We used \(k = 10\) and the R package caret (Kuhn 2023) to separate the data into folds so the same randomization is used to compare competing options.
The forecasts on the validation folds are considered out-of-sample forecasts since the corresponding data were not used for model specification and training. Let \(e_i\) denote the cross-validation error, which is the difference between true yield (\(y_i\)) and the forecast (\(\hat{y}_i\)): \[ e_i = y_i - \hat{y}_i, \] where \(i = 1, 2, \ldots, n\) and \(n\) is the sample size. We evaluated the performance of alternative models using three metrics: mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (\(R^2\)), which were calculated as follows: \[ MAE = n^{-1} \sum |e_i|, \]
\[ RMSE = (n^{-1} \sum e_i^2)^{1/2}, \]
\[ R^2 = 1 - \frac{\sum e_i^2}{\sum (y_i - \bar{y})^2}, \] where \(\bar{y} = n^{-1} \sum y_i\) is the average observed yield. Smaller errors (MAE and RMSE) are preferred, but RMSE penalizes large individual errors \(e_i\) more than MAE. Higher \(R^2\) is preferred as it shows the percentage of yield variability captured with the forecasts. The performance metrics MAE, RMSE, and \(R^2\) generally provided consistent results, but \(R^2\) was used for making final selections.
To identify statistically significant differences between different algorithms, we also used paired \(t\)-tests to compare average performance metrics from \(k\) cross-validation folds. The \(p\)-values from these tests were adjusted for the number of pairwise comparisons using the Benjamini–Hochberg procedure to control the false discovery rate (Benjamini and Hochberg 1995). The tests with the adjustment for multiple testing were implemented using the R packages stats (R Core Team 2022) and rstatix (Kassambara 2023a), and visualized with boxplots using the R package ggpubr (Kassambara 2023b).
In this study, we first ran a cross-validation to compare the algorithms of variable and split selection, from five alternatives (see Section 3.2). For this first round of cross-validation, we used a set of records for which all the predictors were available (see Table 1 for the number of available values for each variable). Given the selected variables, we were able to redefine the dataset if some of the predictors with missing values were removed. Hence, for tuning the nodesize and mtry (see Section 3.1) in the second round of cross-validation, we used a dataset with complete observations that was not less than the original one.
3.4 Model interpretation
Model interpretation techniques enhance our understanding of black-box models by providing insights into the relationships learned by the model and which of the input variables affected the predictions most. Partial dependence plots are a typical way to demonstrate pairwise predictor-response relationships represented in a random forest (Hastie et al. 2009), however, these plots do not provide interpretations to individual predictions. In this study, we use another method, based on Shapley decomposition explaining the contributions of individual predictors (Molnar 2022) and adapted to compute fast (approximate) Shapley values to explain individual predictions in tree-based models like the random forest Štrumbelj and Kononenko (2014).
The SHAP (SHapley Additive exPlanations) values \(\phi_j\) by Lundberg et al. (2020) have the additivity property such that the forecast of yield for a given vector of inputs \(x\) can be decomposed as a sum (Molnar 2022): \[ \hat{y}(x) = \phi_0 + \sum \phi_j, \] where \(\hat{y}(x)\) is the forecast from the random forest model, \(\phi_0 = \mathrm{E}(\hat{y}(X))\) is the baseline set to the average prediction across all cases, and \(j\) is from 1 to the number of predictors \(p\).
First, we use average absolute values of SHAP, denoted as mean(|SHAP value|) in the figures below, to rank the importance of predictors, analogous to using the permutation-based importance in random forest. The SHAP importance shows how much each variable affects the predictions in the dataset, on average.
Second, we provide SHAP-based partial dependence plots. Each of the plots shows SHAP values for a given predictor, hence demonstrating the general pattern of the relationships (similar to the partial dependence plots) and variability.
Third, we summarize the explained effects for subsets of the cases. We are particularly interested in the well-predicted cases (i.e., where the model accuracy was high, with the absolute error below 3 percentage points). To analyze environmental and hatchery conditions responsible for the crashes or high yields, we select well-predicted crashes or well-predicted yields, correspondingly.
To compute the Shapley additive explanations, we used the R package fastshap (Greenwell 2023) and the package shapviz (Mayer 2023) to visualize the results.
4 Results
4.1 Variable selection
From Figure 3, variable selection using the algorithms Boruta, recursive Boruta, and recursive interaction forest led to improved performance (lower errors and higher \(R^2\)) compared with not using a selection algorithm (i.e., using all the variables) in a random forest. For the \(p\)-values of the tests comparing the performances of different algorithms, see Section 6.1. The recursive implementation of Boruta consistently resulted in the most improved performance compared to using all the variables, however, the performance of recursive Boruta was not significantly different than that of Boruta or recursive interaction forest (Figure 3). Recursive Boruta generally selected fewer variables than the recursive interaction forest but more than Boruta (Figure 4). The interaction forest (non-recursive) resulted in using all variables (Figure 4) and hence its results were not different from using all variables altogether (Figure 3). Overall, we proceed with using the recursive implementation of the Boruta algorithm for selecting relevant predictors in the model.
The recursive Boruta algorithm, when applied to the whole dataset, selected 56 predictors that we are using in further analysis (Figure 5).
4.2 Summary of the model
From Figure 3, random forest models using the recursive Boruta algorithm for selecting variables achieve MAE of 6 percentage points, RMSE of 7.9 percentage points, and \(R^2\) of 27.9% on out-of-sample data (data left out when constructing the model for its validation).
With the selected variables, and a refined dataset with all the 56 selected predictors present (\(n =\) 459, which may differ from the summary reported in Table 1), we further improved the model by tuning its hyperparameters. The resulting model was selected based on the highest \(R^2\) in 10-fold cross-validation and used mtry of 20 randomly selected predictors at each split of a tree to separate the data into bins as small as \(nodesize =\) 5 observations.
The cross-validation \(R^2\) of this tuned model was 34.5%, MAE of 5.7 percentage points and RMSE of 7.7 percentage points. The average predicted yield from this model (9.4%) was close to the average observed yield (9.3%) in the dataset (Figure 6).
The in-sample \(R^2\) of the tuned model was 87.8%, MAE of 2.4 percentage points and RMSE of 3.3 percentage points. I.e., the model explained 87.8% of the hatchery yield variability in the sample.
4.3 Shapley
4.3.1 Global
Using Shapley values, we provide model insights. Figure 7 compares average magnitudes of the effects by each variable, showing the week number, winter NDVI, winter salinity, Fecundity, and recent ambient salinity and turbidity being among the top determinants of hatchery yields.
Figure 8 further reveals global patterns and individual variability of the effects by the top variables. It shows that higher yields are typically observed for broods started before week 30 (end of July), after a winter with high water salinity and low NDVI, with smaller fecundity at spawning, ambient salinity above 10, and low ambient turbidity. Note that some of the effects are more neutral than other. For example, starting the production during weeks 10–15 does not necessarily lead to a big improvement because Shapley values for those weeks are around 1, but starting late in the year such as week 40 may lead to yields 4–6 percentage points lower than the baseline. Similarly, turbidity above 2 generally corresponds to lower yields, but turbidity below 2 corresponds to yield improvements as high as 3 percentage points. See Figure 13 for partial dependence plots for all the variables.
4.3.2 Crashes
We further investigated model insights for the cases of hatchery crashes. We selected 108 well-defined crashes (cases with yields below 1% and absolute errors of the model below 3 percentage points). For these cases, Figure 9 shows week and winter NDVI are still the two most important variables, but the next five important variables relate to water salinity, such as the ambient winter salinity and salinity before filling up the tank, and conditioning salinity. Note that the Shapley values are additive such that their sum matches the model output. This property is demonstrated in Figure 10 that explains how the model created each of the predicted values \(f(x)\) given that the baseline average predicted yield was 9.4%.
4.3.3 Highest yields
Similar to the previous section, we selected cases of well-defined high yields, with yields above 20% and absolute error of the model below 3 percentage points (17 cases). For these cases, the rankings of influential variables (Figure 11) appear to be considerably perturbed compared to the rankings in Figure 7 and Figure 9. Thus, Figure 11 shows several turbidity indicators as more important, along with fecundity, NDVI, and pH. This might be related to the fact that these cases of high yields already correspond to the favorable season with weeks from 15 to 29 (average 23) and lower winter NDVI from 0.22 to 0.26 (average 0.24). Figure 8 shows that the Shapley effects at these intervals of predictors are positive but not exceptionally high. Figure 12 gives several example explanations of how the model came up with the accurate predictions of high yields.
5 Conclusion
This analysis demonstrated one of the possible pathways of using observational data to study the effects of factors for improving oyster hatchery production. By applying machine learning to the data, we were able to process a large number of predictors, build and evaluate a variety of models, and extract data patterns that helped the model to achieve accurate predictions of yields. From these patterns, we can identify which conditions lead to lower yields and which hatchery should try to avoid (e.g., starting production late in the season or at low salinity) as well as conditions corresponding to improved yields (e.g., low fecundity and turbidity).
Acknowledgements
We thank Charles Pellerin (ERT contractor working at NOAA) for providing the buoy data. This research was supported by awards from the Chesapeake Bay Trust and Chesapeake Oyster Alliance, and Harry R. Hughes Center for Agro-Ecology.
References
6 Appendix
6.1 Pairwise comparisons of the cross-validation results
Group averages of MAE:
#> All Boruta BorutaRec
#> 6.194901 5.931658 5.964305
#> InteractionForest InteractionForestRec
#> 6.257320 5.945040
Group averages of RMSE:
#> All Boruta BorutaRec
#> 8.075456 7.912044 7.918358
#> InteractionForest InteractionForestRec
#> 8.131388 7.877465
Group averages of \(R^2\):
#> All Boruta BorutaRec
#> 0.2509084 0.2788968 0.2794493
#> InteractionForest InteractionForestRec
#> 0.2420665 0.2866606
Group averages of number of variables selected:
#> All Boruta BorutaRec
#> 144.0 45.2 52.1
#> InteractionForest InteractionForestRec
#> 144.0 56.1
Pairwise comparisons:
#>
#> Pairwise comparisons using paired t tests
#>
#> data: MAE and Version
#>
#> All Boruta BorutaRec InteractionForest
#> Boruta 0.0012 - - -
#> BorutaRec 0.0012 0.2441 - -
#> InteractionForest 0.2083 0.0012 0.0012 -
#> InteractionForestRec 0.0124 0.8543 0.8543 0.0012
#>
#> P value adjustment method: BH
#>
#> Pairwise comparisons using paired t tests
#>
#> data: RMSE and Version
#>
#> All Boruta BorutaRec InteractionForest
#> Boruta 0.034 - - -
#> BorutaRec 0.016 0.904 - -
#> InteractionForest 0.322 0.061 0.029 -
#> InteractionForestRec 0.034 0.794 0.742 0.016
#>
#> P value adjustment method: BH
#>
#> Pairwise comparisons using paired t tests
#>
#> data: R2 and Version
#>
#> All Boruta BorutaRec InteractionForest
#> Boruta 0.047 - - -
#> BorutaRec 0.016 0.957 - -
#> InteractionForest 0.420 0.085 0.041 -
#> InteractionForestRec 0.047 0.756 0.756 0.016
#>
#> P value adjustment method: BH
6.2 Partial plots
Citation
@report{lyubchich2023,
author = {Vyacheslav Lyubchich and Matthew W. Gray and Greg M. Silsbe},
publisher = {University of Maryland Center for Environmental Science},
title = {Machine Learning of Factors for Improving Oyster Hatchery
Production},
date = {2023-10-13},
address = {Cambridge, Maryland, USA},
url = {https://vlyubchich.github.io/OysterHatcheryYield},
doi = {https://doi.org/10.5281/zenodo.8327628},
langid = {en}
}